-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] key renamer #402
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the implementation makes sense but it feels slightly specific. I will let others chime in on their thoughts. Aside from that, just a few nit comments:
It is worth mentioning that we generally name our DataPipe following this convention:
https://github.com/pytorch/data/blob/9ad73d85b398828a919aef692c27f77182716aa1/docs/source/tutorial.rst#naming
I should add that to the contributing guide.
output = list(iter(stage2)) | ||
assert len(output) == 2 | ||
assert set(output[0].keys()) == set(["t", "b"]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test other boolean flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Please switch the order of inputs 'pattern' -> 'new name' looks more natural |
The usual usage is with keyword arguments using a simple key as output and a pattern as input. It also parallels assignment. I think this order is more useful. What do you think? |
In my opinion it makes sense to have two datapipes: |
|
||
def __init__( | ||
self, | ||
source_datapipe: IterDataPipe[List[Union[Dict, List]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source_datapipe: IterDataPipe[Dict[Any, Any]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function isn't primarily as a general stream transformation, but as a quick, simple, and readable way of extracting fields for further processing in a data pipeline. That is, usually this is used for getting tar files with different file name patterns and The current API addresses that use case really well. I would recommend keeping it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks pretty good! Just a few comments. Thanks for the PR!
keep_unselected: keep keys/value pairs even if they don't match any pattern (False) | ||
must_match: all key value pairs must match (True) | ||
duplicate_is_error: it is an error if two renamings yield the same key (True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we move these after *args
?
WebDatasetIterDataPipe as WebDataset, | ||
) | ||
from torchdata.datapipes.iter.util.renamekeys import ( | ||
KeyRenamerIterDataPipe as RenameKeys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyRenamerIterDataPipe as RenameKeys, | |
KeyRenamerIterDataPipe as KeyRenamer, |
*args, | ||
keep_unselected=False, | ||
must_match=True, | ||
duplicate_is_error=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate_is_error=True, | |
allow_duplicate=False, |
nit: might be a better name but feel free to ignore
source_datapipe: a DataPipe yielding a stream of dictionaries. | ||
keep_unselected: keep keys/value pairs even if they don't match any pattern (False) | ||
must_match: all key value pairs must match (True) | ||
duplicate_is_error: it is an error if two renamings yield the same key (True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate_is_error: it is an error if two renamings yield the same key (True) | |
duplicate_is_error: it is an error if two renamings yield the same key (True); otherwise the first matched one will be returned |
This PR adds a filter that allows keys to be renamed in training samples represented as dictionaries. This is particularly useful for webdataset-style data sets, but can also be used with other dictionary iterators.